-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
grandpa: handle error from SelectChain::finality_target #5153
grandpa: handle error from SelectChain::finality_target #5153
Conversation
See humanode-network/humanode#1104 (comment) and humanode-network/humanode#1104 (comment). I didn't think through all the consequences of this, but I think this is correct. This is what was happening before paritytech/substrate#13289, since we were just calling If someone could add a regression test for this I would appreciate it. Feel free to push to this PR or just create a new one. |
Will add the test. |
d401036
to
b1447b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on the first sight, but let me reconcile this logic with the LongestChain::best_chain implementation as there might be more surprises there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on the first sight, but let me reconcile this logic with the LongestChain:: best_chain implementation as there might be more surprises there
That was quite trivial; there is still a race possible there between the invocation of the best_header
in the finality_target
and best_chain
- but I think this will not stall the consensus.
@dmitrylavrenov do you already have the test somewhere? |
Can prepare new PR to this one soon |
Did you now changed it so that it fails without the fix? |
Yep, andresilva#1 |
@dmitrylavrenov @MOZGIII thanks for debugging and providing the test. |
The test has been check on my end
|
Indeed, that is what paritytech/substrate#13364 solved. There is no way to get around this race condition other than unifying |
Yeah! I can only second this! Really good work. |
dc4047c
Fix #3487. --------- Co-authored-by: Dmitry Lavrenov <39522748+dmitrylavrenov@users.noreply.github.com> Co-authored-by: Bastian Köcher <info@kchr.de> Co-authored-by: Bastian Köcher <git@kchr.de>
) Fix paritytech#3487. --------- Co-authored-by: Dmitry Lavrenov <39522748+dmitrylavrenov@users.noreply.github.com> Co-authored-by: Bastian Köcher <info@kchr.de> Co-authored-by: Bastian Köcher <git@kchr.de>
We just deployed this to our testnet as part on updating from
We are still investigating the underlying issue, just wanted to share this here in case anybody has an idea. We currently still hope it has to do with the stalled finality or the way we configured our grandpa/node. It should also be noted that we can no longer revert even a single finalized block on our node, as it complains that the block to be reverted is part of the chain but can not be found going backwards from any of the leaves, which also suggest that the node assumes we are on a dead branch of some sort. EDIT: Latest theory is that our finality lag just hit the prunning window and the real trigger was the nodes restarting. But the numbers do not really add up. And I always assumed only finalized blocks are prunned, unless maybe this patch broke that... |
Thanks for sharing! Could you provide more details on the initial conditions were when you've started the deployment? Was the consensus stalled while the update was conducted? |
This actually looks like a different issue; is seems like a quite long chain of unfinalized blocks would be evicted by the finality round that is about to run - to that end that the block that the system is about to to vote for was very long time ago and got pruned. Could be all sorts of issues:
One thing I still don't get though is why don't we have the ability to set a delay of a couple of blocks for prevote target selection and casting. It seems like the configured finality delay (2 blocks usually) is completely ignored when the prevotes are being casted, as the very nature of the root cause of this could manifest so potently is because the longest chain selection rule didn't converge the chain naturally, which is what's the purpose of the longest chain rule is. In short, maybe there is another issue to report here. |
How big is your finality lag? |
Fix #3487.